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Highlights 

• We describe entropy-scaling search for finding approximate matches 
in a database 

• Search complexity is bounded in time and space by the entropy of the 
database 

• We make tools that enable search of three largely intractable real-world 
databases 

• The tools dramatically accelerate metagenomic, chemical, and protein 
structure search 

eTOC Blurb 

We describe a general framework for efficiently searching massive datasets 
having certain properties common in biology. 

Summary 

Many datasets exhibit a well-defined structure that can be ex¬ 
ploited to design faster search tools, but it is not always clear when 
such acceleration is possible. Here, we introduce a framework for 
similarity search based on characterizing a dataset’s entropy and 
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fractal dimension. We prove that searching scales in time with 
metric entropy (number of covering hyperspheres), if the fractal 
dimension of the dataset is low, and scales in space with the sum 
of metric entropy and information-theoretic entropy (randomness 
of the data). Using these ideas, we present accelerated versions of 
standard tools, with no loss in specificity and little loss in sensi¬ 
tivity, for use in three domains—high-throughput drug screening 
(Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup 
of DIAMOND [3,700 x BLASTX]), and protein structure search 
(esFragBag, 10 x speedup of FragBag). Our framework can be 
used to achieve “compressive omics,” and the general theory can 
be readily applied to data science problems outside of biology. 

Source code: http ://gems.csail.mit. edu 


Introduction 

Throughout all areas of data science, researchers are confronted with in¬ 
creasingly large volumes of data. In many fields, this increase is exponential 
in nature, outpacing Moore’s and Kryder’s laws on the respective doublings 
of transistors on a chip and long-term data storage density (Kahn, 2011). 
As such, the challenges posed by the massive influx of data cannot be solved 
by waiting for faster and larger capacity computers but, instead, require 
instead the development of data structures and representations that exploit 
the complex structure of the dataset. 

Here, we focus on similarity search, where the task at hand is to find all 
entries in some database that are “similar,” or approximate matches, to a 
query item. Similarity search is a fundamental operation in data science and 
lies at the heart of many other problems, much like how sorting is a primi¬ 
tive operation in computer science. Traditionally, approximate matching has 
been studied primarily in the context of strings under edit distance metrics 
(Box 1) (e.g., for a spell-checker to suggest the most similar words to a mis¬ 
spelled word) (Ukkonen, 1985). Several approaches, such as the compressed 
suffix array and the FM-index (Grossi & Vitter, 2005; Ferragina & Manzini, 
2000), have been developed to accelerate approximate matching of strings. 
However, it has been demonstrated that similarity search is also important 
in problem domains where biological data are not necessarily represented 
as strings, including computational screening of chemical graphs (Schaeffer, 
2007) and searching protein structures (Budowski-Tal et ah, 2010). There¬ 
fore, approaches that apply to more general conditions are needed. 
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As available data grow exponentially (Berger et al., 2013; Yu et al., 
2015) (e.g., genomic data in Figure SI), algorithms that scale linearly (Box 
1) with the amount of data no longer suffice. The primary ways in which 
the literature addresses this problem—locality sensitive hashing (Indyk & 
Motwani, 1998), vector approximation (Ferhatosmanoglu et al., 2000), and 
space partitioning (Weber et al., 1998)—involve the construction of data 
structures that support more efficient search operations. However, we note 
that, as biological data increase, not only does the redundancy present in 
the data also increase (Loh et al., 2012), but also internal structure (such 
as the fact that not all conceivable configurations, e.g. all possible protein 
sequences, actually exist) also becomes apparent. Existing general-purpose 
methods do not explicitly exploit the particular properties of biological data 
to accelerate search (see the Theory section in the Supplemental Methods). 

Previously, our group demonstrated how redundancy in genomic data 
could be used to accelerate local sequence alignment. Using an approach that 
we called “compresive genomics,” we accelerated BLAST and BLAT (Kent, 
2002) by taking advantage of high redundancy between related genomes us¬ 
ing link pointers and edit scripts to a database of unique sequences (Loh 
et al., 2012). We have used similar strategies to obtain equally encouraging 
results for local alignment in proteomics (Daniels et al., 2013). Empirically, 
this compressive acceleration appears to scale almost linearly in the entropy 
of the database, often resulting in orders of magnitude better performance; 
however, these previous studies neither proved complexity bounds nor es¬ 
tablished a theory to explain these empirical speedups. 

Here, we generalize and formalize this approach by introducing a frame¬ 
work for similarity search of omics data. We prove that search performance 
primarily depends on a measure of the novelty of new data, also known as 
entropy. This framework, which we call entropy-scaling search, supports the 
creation of a data structure that provably scales linearly in both time and 
space with the entropy of the database, and thus sublinearly with the entire 
database. 

We introduce two key concepts for characterizing a dataset: metric en¬ 
tropy and fractal dimension. Intuitively, metric entropy measures how dis¬ 
similar the dataset is from itself, and fractal dimension measures how the 
number of spheres needed to cover all points in a database scales with the 
radii of those spheres. Both are rigorously defined later, but note that 
metric entropy is not to be confused with the notion of a distance metric 
(Box 1). Using these two concepts, we show that, if similarity is defined 
by a metric-like distance function (e.g., edit or Hamming distance) and 
the database exhibits both low metric entropy and fractal dimension, the 
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entropy-scaling search performs much better than nave and even optimized 
methods. Through three applications to large databases in chemogenomics, 
metagenomics, and protein structure search, we show that this framework 
allows for minimal (or even zero) loss in recall, coupled with zero loss in 
specificity. The key benefit of formulating entropy-scaling search in terms 
of metric entropy and fractal dimension is that this allows us to provide 
mathematically rigorous guidance as to how to determine the efficacy of the 
approach for any dataset. 


• Edit distance: the number of edits (character insertions, deletions, or substi¬ 
tutions) needed to turn one string into another. 

• Scale, in time and space: the amount of time or space a task takes as a 
function of the amount of data on which it must operate. A task requiring 
time directly proportional to the size of the data is said to scale linearly; for 
example, searching a database takes twice as long if the database grows by 
a factor of two. 

• Distance metric: a measure of distance that obeys several mathematical prop¬ 
erties, including the triangle inequality. 

• Covering spheres: a set of spheres around existing points so that every point 
is contained in at least one sphere and no sphere is empty. 

• Metric entropy: a measure of how dissimilar a dataset is from itself. Defined 
as the number of covering spheres. 

• Fractal dimension: a measure of how the number of points contained within 
a sphere scales with the radius of that sphere. 

• Information-theoretic entropy: often used in data compression as shorthand 
for the number of bits needed to encode a database or a measure of the 
randomness of that database. 

• Pattern matching: refers to searching for matches that might differ in specific 
ways from a query, such as wildcards or gaps, as opposed to searching for 
all database entries within a sphere of a specified radius as defined by an 
arbitrary distance function. 


Box 1: Definitions 
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Figure 1: Entropy-scaling framework for similarity search. (A-D) As 
shown, (A) The naive approach tests each query against each database entry 
to find entries within distance r of the query (inside the small green disc). 
(B) By selecting appropriate cluster centers with maximum radius r c to par¬ 
tition the database, we can (C) first do a coarse search to find all cluster 
centers within distance r + r c of a query (larger green disc), and then the 
(D) triangle inequality guarantees that a fine search over all corresponding 
cluster entries (blue polygonal regions) will suffice. 














Results 


Entropy-scaling similarity search 

The basic framework for the entropy-scaling search of a database involves 
four steps. (1) Analyze the database to define a high-dimensional space 
and determine how to map database entries onto points in this space (this 
mapping may be one-to-one). (2) Use this space and a measure of similarity 
between points to group entries in the database into clusters. (3) To search 
for a particular query item, perform a coarse-grained search to identify the 
clusters that could possibly contain the query. (4) Do a fine-grained search 
of the points contained within these clusters to find the closest matches to 
the query (Figure 1). 

Here, we provide conceptual motivation for this process. In the following 
text, we consider entropy to be nearly synonymous with distance between 
points in a high-dimensional space; thus, with low entropy, newly added 
points do not tend to be far from all existing points. For genomic sequences, 
the distance function can be edit distance; for chemical graphs, it can be 
Tanimoto distance; and for general vectors, it can be Euclidean or cosine 
distance. We are interested in the similarity search problem of finding all 
points in a set that are close to (i.e., similar to) the query point. 

Let us first consider what it means for a large biological dataset, consid¬ 
ered as points in a high-dimensional space, to be highly redundant. Perhaps 
many of the points are exact duplicates; this easy scenario is trivially ex¬ 
ploited by de-duplication and is already standard practice with datasets such 
as the NCBI’s non-redundant (NR) protein database (Pruitt et ah, 2005). 
Maybe the points mostly live on a low-dimensional subspace; statistical tools 
such as Principal Component Analysis (PCA) exploit this property in data 
analysis. Furthermore, if the dimension of the subspace is sufficiently low, it 
can be divided into cells, allowing quick similarity searches by looking only 
at nearby cells (Weber et al., 1998). However, when the dimensionality of 
the subspace increases, cell search time grows exponentially; additionally, in 
sparse datasets, most of the cells will be empty, which wastes search time. 

More importantly, biological datasets generally do not live in low-dimensional 
subspaces. Consider the instructive case of genomes along an evolutionary 
“tree of life” (Figure 2). Such a tree has many branches (although admixture 
merges branches back together), and looks nearly one-dimensional locally, 
but it is globally of higher dimension. Additionally, because of differences 
due to mutation, each of the branches is also “thick” (high dimensional) 
when looked at closely. Viewing this example as a low-dimensional sub¬ 
space, as in PCA, is incorrect. 
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Figure 2: Cartoon depiction of points in a high-dimensional space. This 
cartoon depics points in an arbitrary high-dimensional space that live close 
to a one-dimensional tree-like structure, as might arise from genomes gener¬ 
ated by mutation and selection along an evolutionary tree of life. Although 
high-dimensional at a fine scale, at the coarser scale of covering spheres, the 
data cloud looks nearly one-dimensional, which enables entropy-scaling of 
similarity search. The cluster center generation was performed using the 
same method we used for protein structure search. The blue circles around 
the green query point illustrate low fractal dimension: the larger-radius cir¬ 
cle contains only linearly more points than the smaller one, rather than 
exponentially more. In contrast, the red circles around the orange query 
point illustrate higher local fractal dimension. 



However, the local low-dimensionality can be exploited by looking on the 
right scales: a coarse scale in which the tree looks one-dimensional locally 
and a fine scale where the branch width matters. We cover the tree with 
spheres (Box 1) of radius r c , where r c is on the order of the branch width; 
these spheres determine our clusters, and the number of them is the metric 
entropy of the tree (Tao, 2008). Because all the points within a sphere are 
close to each other, they are highly redundant and can be encoded in terms 
of one another, saving space. 

By the triangle inequality, in order to search for all points within distance 
r of a query, we need only to look in nearby spheres with centers (i.e., 
representatives) within a distance r + r c of the query (Figure ID). However, 
because each sphere has a radius comparable to branch width, the tree is 
locally one-dimensional on the coarse scale; that is, spheres largely tend to 
extend along the branches of the tree rather than in all directions. We will 
call this property of local scaling the fractal dimension d of the tree at the 
scale r c (Falconer, 1990), where r c is essentially our ruler size and d = 1. 
Thus, increasing the search radius for coarse search only linearly increases 
the number of points that need to be searched in a fine search. 

A similar analysis holds in the more general case where d / 1. The 
entropy-scaling frameworks we introduce can be expected to provide a boost 
to approximate search when fractal dimension d of a dataset D is low (i.e., 


close to 1) and metric entropy k is low. Specifically, the ratio 


D 

T 


provides 


an estimate of the acceleration factor for just the coarse search component 
compared to a full linear search of a database D. Local fractal dimension 
around a data point can be computed by determining the number of other 
data points within two radii r\ and ?’2 of that point; given those point counts 

(ni and ri 2 , respectively), fractal dimension d is simply d = - ^ . 

log(r 2 /ri) 

Sampling this property over a dataset can provide a global average fractal 
dimension. When we search a larger radius around a query, the number 
of points we encounter grows exponentially with the fractal dimension; low 
fractal dimension implies that this growth will not obviate the gains provided 
by an entropy-scaling data structure. 

More formally, given a database with fractal dimension d and metric 
entropy k at the scale r c , we show in the Supplemental Methods that the 
time-complexity of similarity search on database D for query q with radius 



r is 


( output size . 

+15/teoi(%^) ). 

metric entropy ^ ✓ / 

scaling factor 

Thus, for small fractal dimension and output size, similarity search is asymp¬ 
totically linear in metric entropy. Additionally, because the search has to 
look at only a small subset of the clusters, the clusters can be stored in 
compressed form, and only decompressed as needed, giving space savings 
that also scale with entropy. The space-complexity scales with the sum of 
metric and information-theoretic entropy, rather than just metric entropy 
(Supplemental Methods: Theory). 

Practical application of entropy-scaling search 

We have presented the simplest such data to analyze for clarity of ex¬ 
position. However, real data is generally messier. Sometimes the distance 
function is not a metric, so we lose the triangle inequality guarantee of 100% 
sensitivity; sometimes different distance functions can be used for the clus¬ 
tering versus search; and sometimes even what counts as a distinct data 
point is not entirely clear without domain knowledge (for example, long 
genomic sequences might be better broken into shorter subsequences). 

To show that entropy-scaling frameworks are robust to the variations 
presented by real data, we explored a diversity of applications from three 
major biological “big challenges of big data”—pharamaceuticals, metage¬ 
nomics, and protein structure (Marx, 2013). We demonstrate that the gen¬ 
eral scheme results in order-of-magnitude improvements in running time in 
these different contexts, promising to enable new workflows for practition¬ 
ers (e.g., fast first-pass computational drug screens and local analyses of se¬ 
quencing data in remote field sites for real-time epidemic monitoring). These 
applications are enabled by augmenting the framework with domain-specific 
distance functions in different stages of the process, as well as preprocessing 
to take advantage of domain-specific knowledge. We expect that as long as 
the dataset exhibits both low entropy and low fractal dimension—and this 
is empirically true in biological systems—our entropy-scaling framework has 
the potential to achieve massive speedup over more naive methods and sig¬ 
nificant speedup even over other highly optimized methods. 

Source code for the applications discussed here is available at http: 
//gems. csail .mit. edu and in the Supplemental Information. 
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Application to high-throughput drug screening 

Chemogenomics is the study of drug and target discovery by using chem¬ 
ical compounds to probe and characterize proteomic functions (Bredel & 
Jacoby, 2004). Particularly in the held of drug discovery and drug repur¬ 
posing, prediction of biologically active compounds is a critical task. Com¬ 
putational high-throughput screening can eliminate many compounds from 
wet-lab consideration, but even this screening can be time-consuming. Pub- 
Chern (Bolton et al., 2008), a widely-used repository of molecular compound 
structures, has grown greatly since 2008. In July 2007, PubChem contained 
10.3 million compounds. In October 2013, PubChem contained roughly 47 
million compounds, while in December 2014 it contained 61.3 million com¬ 
pounds. 

We designed this compression and search framework around one of the 
standard techniques for high-throughput screening of potential drug com¬ 
pounds, the use of maximum common subgraph (MCS) to identify simi¬ 
lar motifs among molecules (Cao et ah, 2008; Rahman et ah, 2009). We 
introduce Ammolite, a method for clustering molecular databases such as 
PubChem, and for quickly searching for similar molecular structures in com¬ 
pressed space. Ammolite demonstrates that entropy-scaling methods can be 
extended to data types that are not inherently sequence based. Ammolite 
is a practical tool that provides approximately a factor of 150 speed-up 
with greater than 92% accuracy compared to the popular small molecular 
subgraph detector (SMSD) (Rahman et al., 2009). 

An MCS-based search of molecule databases typically matches pairs of 
molecules by Tanimoto distance (Rahman et al., 2009). Tanimoto distance 
obeys the triangle inequality and is more useful in the domain of molecular 
graphs than other distance metrics such as graph distance (Bunke & Shearer, 
1998). 

To compress a molecule database, we project the space of small molecules 
onto a subspace by removing nodes and edges that do not participate in sim¬ 
ple cycles (Figure S2); note that a molecule without cycles will collapse to 
a single node. Clusters are exactly pre-images of this projection operator 
(i.e., all molecules that are isomorphic after simplification form a cluster). 
Coarse search is performed by finding the MCS on this much smaller pro¬ 
jection subspace. This step increases speed by reducing both the required 
number of MCS operations and the time required for each MCS operation, 
which scales with the size of the molecule. Further reduction in search 
time is accomplished by grouping clusters according to size of the molecules 
within; because Tanimoto distance relies on molecule size, clusters contain- 
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ing molecules significantly larger or smaller then the query need not be 
searched at all. 

The time required to cluster a large database such as PubChem is, 
nonetheless, significant; clustering the 306-GB PubChem required approxi¬ 
mately 400 hours on a 12-core Xeon X5690 running at 3.47GHz, and required 
128 GB RAM. However, this database can easily be appended to as new 
molecules become available, and the clustering time can be amortized over 
future queries. It is worth noting that the preprocessing of molecular graphs 
can cause the triangle inequality to be violated; while the distance function 
is a metric, the clustering does not respect that metric. Ammolite can 
be readily incorporated into existing analysis pipelines for high-throughput 
drug screening. 

Our entropy-scaling framework can be applied to PubChem because it 
has both low fractal dimension and low metric entropy. In particular, we 
determined the mean local fractal dimension of PubChem to be approx¬ 
imately 0.2 in the neighborhood between 0.2 and 0.4 Tanimoto distance, 
and approximately 1.9 in the neighborhood between 0.4 and 0.5. The ex¬ 
pected speedup is measured by the ratio of database size to metric entropy, 
which, for PubChem is approximately 11:1. This is not taking into account 
the clustering according to molecule size, which further reduces the search 
space. 

Because SMSD is not computationally tractable on the entire PubChem 
database, we benchmarked Ammolite against SMSD on a subset of 1 mil¬ 
lion molecules from PubChem. Since SMSD’s running time should scale 
linearly with the size of the database, we extrapolated the running time 
of SMSD to the entire PubChem database. Benchmarking Ammolite and 
SMSD required 60GB RAM and used 12 threads, although Ammolite’s 
search, used normally, requires < 20 GB RAM. For these benchmarks, we 
used five randomly-chosen query molecules with at least two rings (Pub¬ 
Chem IDs 1504670, 19170294, 28250541, 4559889, and 55484477), as well 
as five medically-interesting molecules chosen by hand (adenosine triphos¬ 
phate [atp], clindamycin, erythromycin, teixobactin, and thalidomide). We 
also used SMSD as a gold standard against which we measured Ammolite’s 
recall. 

Ammolite achieves an average of 92.5% recall with respect to SMSD 
(Table la). This recall is brought down by one poorly-performing compound, 
PubChem ID 1504670, with only 62.5% recall, but is otherwise over 80%. 
Furthermore, Ammolite’s speed gains with respect to SMSD grow as the 
database grows (Table lb). 
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Table 1: Benchmarks of Ammolite vs. SMSD on databases of (a) 1 million 
molecules and (b) 47 million molecules (all of PubChem) 

(a) Ammolite benchmark on database of 1 million molecules 

PubChem ID 

SMSD (hours) Ammolite (hours) 

Speedup Recall (%) 

5957 (atp) 

4.4 

0.14 

31 

81% 

446598 (clindamycin) 

18.7 

1.5 

11.7 

90% 

12560 (erythromycin) 

849.6 

3.0 

279.2 

91% 

86341926 (teixobactin) 

618.5 

2.3 

265.5 

100% 

5426 (thalidomide) 

48.9 

0.81 

60.4 

100% 

1504670 

8.1 

0.8 

10.3 

62.5% 

19170294 

31.3 

0.8 

39.7 

100% 

28250541 

43.3 

4.8 

9.0 

100% 

4559889 

108.8 

2.7 

41.0 

100% 

55484477 

23.3 

2.5 

9.1 

100% 

(b) Ammolite benchmark 

on entire PubChem database as of October 2013. See 



also Figure S2. 

PubChem ID 

Ammolite (hours) 

Speedup 

5957 (atp) 

4.1 

51.3 

446598 (clindamycin) 

28.4 

14.5 

12560 (erythromycin) 

79.1 

512.9 

86341926 (teixobactin) 

96.5 

305.9 

5426 (thalidomide) 

29.2 

80.0 

1504670 

4.6 

84.4 

19170294 

6.0 

247.4 

28250541 

38.9 

53.2 

4559889 

57.3 

90.7 

55484477 

35.5 

31.4 



Application to metagenomics 

Metagenomics is the study of genomic data sequenced directly from envi¬ 
ronmental samples. It has led to improved understanding of how ecosystems 
recover from environmental damage (Tyson et al., 2004) and how the hu¬ 
man gut responds to diet and infection (David et ah, 2014). Metagenonrics 
has even provided some surprising insights into disorders such as Autism 
Spectrum Disorder (MacFabe, 2012). 

BLASTX (Altschul et ah, 1990) is widely used in metagenonrics to 
map reads to protein databases such as KEGG (Kanehisa &; Goto, 2000) 
and NCBI’s NR (Sayers et ah, 2011). This mapping is additionally used 
as a primitive in pipelines such as MetaPhlAn (Segata et al., 2012), PI- 
CRUSt (Langille et ah, 2013), and MEGAN (Huson et ah, 2011) to de¬ 
termine the microbial composition of a sequenced sample. Unfortunately, 
BLASTX’s runtime requirements scale linearly with the product of the size 
of the full read dataset and the targeted protein database, and thus each 
year require exponentially more runtime to process the exponentially grow¬ 
ing read data. These computational challenges are at present a barrier 
to widespread use of metagenomic data throughout biotechnology, which 
constrains genomic medicine and environmental genomics (Frank &; Pace, 
2008). For example, Mackelprang et al. (2011) reported that using BLASTX 
to map 246 million reads against KEGG required 800,000 CPU hours at a 
supercomputing center. 

Although this is already a problem for major research centers, it is es¬ 
pecially limiting for on-site analyses in more remote locations. In surveying 
the 2014 Ebola outbreak, scientists physically shipped samples on dry ice to 
Harvard for sequencing and analysis (Gire et al., 2014). Even as sequencers 
become more mobile and can thus be brought on site, lack of fast Internet 
connections in remote areas can make it impossible to centralize and ex¬ 
pedite processing (viz., the cloud); local processing on resource-constrained 
machines remains essential. Thus, a better-scaling and accurate version 
of BLASTX raises the possibility of not only faster computing for large re¬ 
search centers, but also of performing entirely on-site sequencing and desktop 
metagenomic analyses. 

Recently, approaches such as RapSearch2 (Zhao et al., 2012) and Dia¬ 
mond (Buchfink et al., 2015) have provided faster alternatives to BLASTX. 
We have applied our entropy-scaling framework to the problem of metage¬ 
nomic search and demonstrate MICA, a method whose software implementa¬ 
tion provides an acceleration of DIAMOND by a factor of 3.5, and BLASTX 
by a factor of up to 3700. This application illustrates the potential of 
entropy-scaling frameworks, while providing a useful tool for metagenomic 
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research. It can be readily incorporated into existing analysis pipelines (e.g., 
for microbial composition analysis using MEGAN). MICA clustering of the 
September 17, 2014 NCBI NR database (containing 49.3 million sequences) 
required 39 hr on a 12-core Xeon X5690 running at 3.47GHz; it used ap¬ 
proximately 84 GB of resident memory. 

Our entropy-scaling framework can be applied to the NCBI’s NR database 
because it, like PubChem, exhibits low fractal dimension and metric entropy. 
We determined the mean local fractal dimension of the NCBI’s NR database, 
using sequence identity of alignment as a distance function, to be approx¬ 
imately 1.6 in the neighborhood between 70% and 80% protein sequence 
identity. The ratio of database size to metric entropy, which gives an indi¬ 
cator of expected speedup, is approximately 30:1. Indeed, the notion that 
protein sequence space exhibits structure, and lends itself to clustering, has 
precedent (Linial et ah, 1997). 

To evaluate the runtime performance of MICA, we tested it against 
BLASTX, RapSearch2 (Zhao et al., 2012) and Diamond (Buchfink et ah, 
2015). On five read sets (ERR335622, ERR335625, ERR335631, ERR335635, 
ERR335636) totalling 207,623 151-nucleotide (nt) reads from the American 
Gut Microbiome project, we found that MICA provides measurable runtime 
improvements over DIAMOND with no further loss in accuracy (Table 2a), 
and substantial runtime improvements over BLASTX. Notably, the mean 
running time for BLASTX was 58,215 minutes, while MICA took an av¬ 
erage of 15.6 minutes, a speedup of 3,724x. MICA uses DIAMOND for its 
coarse search, and can use either DIAMOND or BLASTX for its fine search. 

We also evaluated MICA using BLASTX for both the coarse and the fine 
search; this approach performed slightly slower than DIAMOND, requiring 
an average of 89 min, though it was somewhat more accurate, at 95.9% 
recall compared to DIAMOND’S 90.4% recall. MICA using BLASTX for 
both coarse and fine searches relied on a query-side clustering (discussed 
in Supplemental Methods); we note that the time spent performing query- 
side clustering is included here; without query-side clustering, this variant 
of MICA takes 2,278 min, a speedup of 25x over BLASTX. 

MICA accelerates DIAMOND with no further loss in accuracy: 90.4% 
compared to unaccelerated BLASTX Table 2b. Experiments validating ac¬ 
curacy treated BLASTX as a gold standard. Since MICA accelerates DI¬ 
AMOND using entropy-scaling techniques, false positives with respect to 
DIAMOND are not possible, but false negatives are. We report as accuracy 
the fraction of BLASTX hits that are also returned by MICA. 

DIAMOND’S clever indexing and alphabet reduction provide excellent 
runtime performance already, though its running time still scales linearly 
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Table 2: (a) Running time and (b) accuracy of BLASTX, RapSearch2, 
DIAMOND, and MICA. Data set is the American gut microbiome project 
read sets ERR335622, ERR335625, ERR335631, ERR335635, ERR335636 


(a) Running time in minutes (standard deviation) 

BLASTX RapSearch2 DIAMOND MICA-DIAMOND MICA-BLASTX 

58215 (1561.8) 206 (5.4) 54 (1.1) 15.6 (0.5) 21.9 (1.7) 

(b) Accuracy against BLASTX (standard deviation) 

RapSearch2 DIAMOND MICA-DIAMOND MICA-BLASTX 
79.5% (1.63) 90.4% (3.10) 90.4% (3.10) 90.4% (3.10) 


with database size. In contrast, as an entropy-scaling search, MICA will 
demonstrate greater acceleration as database sizes grow (Daniels et al., 
2013). Moreover, MICA can use standard BLASTX for its fine search, which 
allows the user to pass arbitrary parameters to the underlying BLASTX call 
but which also comes at a small runtime penalty (40% in our testing). This 
option allows for additional BLAST arguments that DIAMOND does not 
support, such as XML output, which may be useful in some pipelines. Thus, 
MICA with BLASTX may be suitable for a wider variety of existing analysis 
pipelines. 

Application to protein structure search 

The relationship between protein structure and function has been a sub¬ 
ject of intense study for decades, and this strong link has been used for the 
prediction of function from structure (Hegyi &; Gerstein, 1999). Specifically, 
given a protein of solved (or predicted) structure but unknown function, the 
efficient identification of structurally similar proteins in the Protein Data 
Bank (PDB) is critical to function prediction. Finding structural neigh¬ 
bors can also give insight into the evolutionary origins of proteins of interest 
(Yona et ah, 1999; Nepomnyachiy et al., 2014). 

One approach to finding structural neighbors is to attempt to align the 
query protein to all the entries in the PDB using a structural aligner, such 
as STRUCTAL (Subbiah et al., 1993), ICE (Shindyalov & Bourne, 1998), or 
Matt (Menke et al., 2008). However, performing a full alignment against ev¬ 
ery entry in the PDB is prohibitively expensive, especially as the database 
grows. To mitigate this, (Budowski-Tal et al., 2010) introduced the tool 
FragBag, which avoids performing full alignments but rather describes each 
protein as a “bag of fragments,” where each fragment is a small structural 
motif. FragBag has been reported as comparable to structural aligners such 
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as STRUCTAL or ICE, and its bag-of-fragments approach allows it to per¬ 
form comparisons much faster than standard aligners. Importantly for us, 
the bag of fragments is just a frequency vector, making FragBag amenable 
to acceleration through entropy-scaling. 

By first verifying that the local fractal dimension of PDB FragBag fre¬ 
quency vectors is low in most regimes (d ~ 2 — 3), Figure S3), we are given 
reason to think that this problem is amenable to entropy-scaling search. As 
an estimate of potential speedup, the ratio of PDB database size to metric 
entropy at for the chosen cluster radii is on average, ~10:1. We directly ap¬ 
plied our entropy-scaling framework without any additional augmentation: 
esFragBag (entropy-scaling FragBag) is able to achieve an average factor 
of 10 speedup of the highly-optimized FragBag with less than 0.2% loss in 
sensitivity and no loss in specificity. 

For this last example, we intentionally approach the application of entropy¬ 
scaling frameworks to FragBag in a blind manner, without using any domain- 
specific knowledge. Instead, we use the very same representation (bag of 
fragments) and distance functions (Euclidean and cosine distances) as Frag¬ 
Bag, coupled with a greedy k-centers algorithm to generate the clustered 
representation. Note that this is in contrast to MICA and Ammolite, which 
both exploit domain knowledge to further improve performance. Thus, es¬ 
FragBag only involves extending an existing codebase with new database 
generation and similarity search functions. 

We investigate the increases in speed resulting from directly applying 
the entropy-scaling framework for both Euclidean and cosine distances and 
found the acceleration is highly dependent on both the search radius and 
cluster radius (Figure 3). For cosine distance, we generated databases with 
maximum cluster radii of 0.1, 0.2, 0.3, 0.4, and 0.5. Then, for each query 
protein from the set {4rhv, lake, lbmf, lrbp} (identified by PDB IDs), we 
ran both naive and accelerated similarity searches with radii of 0.02i,Vi £ 
{0,... ,49}. This test was repeated 5 times for each measurement, and the 
ratio of average accelerated vs naive times is shown in Figure 3a. 

For Euclidean distance, we generated databases with maximum cluster 
radii of 10, 20, 25, 50, and 100. Again, for each query protein drawn from 
the same set, we compared the average over five runs of the ratio of average 
accelerated versus naive times (Figure 3b). The cluster generation required 
anywhere from 65 to 23,714 seconds, depending on the choice of radii (See 
table 3) and no more than a small constant (< 3) times as much memory 
as it takes to simply load the PDB database (no more than 2 GB RAM). 
Clustering used 20 threads on a 12-core Xeon X5690, while search used only 
one thread. 
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Table 3: Cluster generation time for esFragBag 
(a) Cosine distance: 


radius 

0.1 

0.2 

0.3 

0.4 

0.5 

time (s) 

21,037 

11,088 

7,409 

5,288 

3,921 


(b) Euclidean distance: 


radius 

10 

20 

30 

40 50 

time (s) 

23,714 

3,062 

483 

144 65 


Not only is the acceleration highly dependent on both the search radius 
r and the maximum cluster radius r c , but the choice of query protein also 
affects the results. We suspect that this effect is due to the geometry of pro¬ 
tein fragment frequency space being very “spiky” and “star-like”. Proteins 
that are near the core (and thus similar to many other proteins) show very 
little acceleration when our framework is used because the majority of the 
database is nearby, whereas proteins in the periphery have fewer neighbors 
and are thus found much more quickly. Changing the maximum cluster ra¬ 
dius effectively makes more proteins peripheral proteins, but at the cost of 
overall acceleration. 

Naturally, as the search radius expands, it quickly becomes necessary 
to compare against nearly the entire database, destroying any acceleration. 
For the cosine space in particular, note that the maximum distance between 
any two points is 1, so once the coarse search radius of r + r c > 1.0, there 
cannot ever be any acceleration as the fine search encompasses the entire 
database. Similarly, once the coarse search encompasses all (or nearly all) 
the clusters in Euclidean space, the acceleration diminishes to a factor 1, 
and the overhead costs make the entropy-scaling framework perform worse 
than a naive search. However, as we are most interested in proteins that 
are very similar to the query, the low-radius behavior is of primary interest. 
In the low-radius regime, esFragBag demonstrates varying though substan¬ 
tial acceleration (2-30x, averaging >10x for both distance functions for the 
proteins chosen) over FragBag. 

It is instructive to note that because of the very different geometries of 
Euclidean vs cosine space, acceleration varies tremendously for some pro¬ 
teins, such as 4rhv and lbmf, which display nearly opposite behaviors. 
Whereas there is nearly 30x acceleration for 4rhv in cosine space for low 
radius, and the same for lbmf in Euclidean space, neither achieves better 
than ~ 2.5x acceleration in the other space. 
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(a) Cosine distance 


(b) Euclidean distance 










Figure 3: Scaling behavior of esFragBag. EsFragBag benchmarking data 
with parameters varied until the acceleration advantage of esFragBag dis¬ 
appears. As search radius increases, the fraction of the database returned 
by the coarse search increases, ultimately returning the whole database. 
Unsurprisingly, when returning the whole database in the coarse search re¬ 
sults, there are no benefits to using entropy-scaling frameworks, (a) Cosine 
distance gives on the whole better acceleration, but results in > 99.8% sen¬ 
sitivity, whereas (b) Euclidean distance as a metric is guaranteed by the 
Triangle Inequality to get 100% sensitivity. 
































































































































Table 4: Average sensitivity of esFragBag compared to FragBag when using 
cosine distance for the trials described in Figure 3a. This table averages 
the sensitivities for each choice of search radii {0, 0.01,..., 0.49}. (NB: no 
analogous table is given for Euclidean distance as the Triangle Inequality 
ensures perfect recall). 


Query protein 

Cluster radii 

4rhv 

lake 

lbmf 

lrbp 

0.10 

1 

0.999840 

0.998490 

0.999950 

0.20 

1 

0.999918 

0.999001 

0.999978 

0.30 

1 

0.999926 

0.999649 

1 

0.40 

1 

0.999974 

0.999796 

1 

0.50 

1 

0.999984 

0.999934 

1 


Finally, while Euclidean distance is a metric—for which the triangle 
inequality guarantees 100% sensitivity—cosine distance is not. Empirically, 
however, for all of the queries we performed, we achieve > 99.8% sensitivity 
(Table 4). 


Application to other domains 

We anticipate that our entropy-scaling approach will be useful to other 
kinds of biological data sets; applying it to new data sets will require sev¬ 
eral steps. Here we provide a “cookbook” for applying our entropy-scaling 
framework to a new data set. Given a new data set, we first define what 
the high-dimensional space is. For metagenomic sequence data, it is the 
set of enumerable protein sequences up to some maximum length, while for 
small-molecule data, it is the set of connected chemical graphs up to some 
maximum size, and for protein structure data (using the FragBag model) it 
is the set of “bag-of-words” frequency vectors of length 400. 

Given the high dimensional space, we determine how database entries 
map onto points (for example, in the case of MICA, they are greedily broken 
into subsequences with a minimum length). Next, clustering can be imple¬ 
mented; a simple greedy clustering may suffice (as for esFragBag) but clus¬ 
tering of sequence data may be dramatically accelerated by using BLAST- 
style seed-and-extend matching (as used in MICA). Finally, coarse and fine 
search can be implemented; in many cases, existing tools may be used “out of 
the box,” as with esFragBag and MICA. With MICA, we note that coarse 
search by default uses DIAMOND, while fine search provides a choice of 
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DIAMOND or BLASTX. With Ammolite, we used the SMSD library, but 
incorporated it into our own search tool. 


Discussion 

We have introduced an entropy-scaling framework for accelerating ap¬ 
proximate search, allowing search on large omics datasets to scale, even as 
those datasets grow exponentially. The primary advance of this framework 
is that it bounds both time and space as functions of the dataset entropy 
(albeit using two different notions of entropy: metric entropy bounds time, 
while information-theoretic entropy bounds space). We proved that run¬ 
time scales linearly with the entropy of the database, but we also show 
(Supplemental Methods: Theory) that under certain additional constraints, 
this entropy-scaling framework permits a compressed representation on disk. 
This compression is particularly applicable in the case of metagenomic anal¬ 
ysis, where the collection of read data presents a major problem for storage 
and transfer. Although we did not optimize for on-disk compression in any 
of our applications, choosing instead to focus on search speed, implementing 
this compression is feasible using existing software tools and libraries such 
as Blocked GZip (BZGF); each cluster would be compressed separately on 
disk. 

Furthermore, we have justified and demonstrated the effectiveness of 
this framework in three distinct areas of computational molecular biology, 
providing the following open-source software: Ammolite for small-molecule 
structure search, MICA for metagenomic analysis, and esFragBag for protein 
structure search. All of our software is available under the GNU General 
Public License, and not only can the tools we are releasing be readily plugged 
into existing pipelines, but the code and underlying methods can also be 
easily incorporated into the original software that we are accelerating. 

The reason for the speedup is the combination of low fractal dimension 
and low metric entropy. Low fractal dimension ensures that runtime is 
dominated by metric entropy. The size of the coarse database provides an 
estimate of metric entropy. Furthermore, we can directly measure the local 
fractal dimension of the database by sampling points from the database and 
looking at the scaling behavior of the number of points contained in spheres 
of increasing radii centered on those sampled points. We have shown that 
for three domains within biological data science, metric entropy, and fractal 
dimension are both low. 

As discussed in the theoretical results, although the data live locally on 
a low dimension subspace, the data are truly high-dimensional globally. At 
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small scales, biological data often lives on a low-dimensional polytope (Hart 
et al., 2015). However, omics data are, by nature, comprehensive and include 
not just one but many such polytopes. Although each polytope can be 
individually projected onto a subspace using techniques such as PCA, the 
same projection cannot be used for all the polytopes at once because they 
live on different low-dimensional subspaces. Furthermore, as is the case 
with genomes, the low-dimensional polytopes are also often connected (e.g., 
through evolutionary history). Thus, collections of local projections become 
unwieldy. By using our clustering approach, we are able to take advantage 
of the existence of these low-dimensional polytopes for accelerated search 
without having to explicitly characterize each one. 

A hierarchical clustering approach, rather than our flat clustering, has 
the potential to produce further gains (Loh et ah, 2012). We have taken the 
first steps in exploring this idea here; the molecule size clustering in Am- 
molite can be thought of as an initial version of a multi-level or hierarchical 
clustering. 

Entropy-scaling search is related to succinct, compressed, and oppor¬ 
tunistic data structures, such as the compressed suffix array, the FM-index, 
and the sarray (Grossi & Vitter, 2005; Ferragina & Manzini, 2000; Conway 
& Bromage, 2011). However, these solve the problem of theoretically fast 
and scalable pattern matching (Box 1), whereas we solve, theoretically and 
practically, the much more general similarity search problem. An entropy¬ 
scaling search tree is also related to a metric ball tree (Uhlmann, 1991), 
although with different time complexity. Querying a metric ball tree re¬ 
quires O(logn) time, assuming the relatively uniform distribution of data 
points in a metric space. This distribution differs from the non-uniform dis¬ 
tribution under which entropy-scaling search behaves well. In future work, 
we will investigate further acceleration of coarse search by applying a metric 
ball tree to the cluster representatives themselves; this approach may reduce 
the coarse search time to O(logfc). This step, too, can be thought of as an 
additional level of clustering. 

Other metric search trees can also be found in the database literature 
(Zezula et al., 2006), although, to our knowledge, they have not been ex¬ 
plicitly applied to biological data science. The closest analogue to entropy¬ 
scaling search trees is the M-tree (Ciaccia et al., 1997, 1998), which resembles 
a multi-level variation of our entropy-scaling search trees. However, the Mi- 
tree time-complexity analysis (Ciaccia et al., 1998) does not have a nice 
closed form and is more explicitly dependent on the exact distribution of 
points in the database. By using and combining the concepts of metric en¬ 
tropy and fractal dimension for our analysis, we are able to give an easier to 
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understand and more intuitive, if somewhat looser, bound on entropy-scaling 
search tree complexity. 

Entropy-scaling frameworks have the advantage of becoming proportion¬ 
ately faster and space efficient with the size of the available data. Although 
the component pieces (e.g., the clustering method chosen) of the framework 
can be either standard (as in esFragBag) or novel (as in Ammolite), the 
key point is that these pieces are used in a larger framework to exploit 
the underlying complex structure of biological systems, enabling massive 
acceleration by scaling with entropy. We have demonstrated this scaling 
behavior for common problems drawn from metagenonrics, cheminformat- 
ics, and protein structure search, but the general strategy can be applied 
directly or with simple domain knowledge to a vast array of other problems 
faced in data science. We anticipate that entropy-scaling frameworks should 
be applicable beyond the life sciences, wherever physical or empirical laws 
have constrained data to a subspace of low entropy and fractal dimension. 


Methods 

Ammolite small molecule search 

Ammolite’s clustering approach relies on structural similarity. We aug¬ 
mented the entropy-scaling data structure by using a clustering scheme 
based on molecular structural motifs instead of a distance function. Each 
molecule is “simplified” by removing nodes and edges that do not participate 
in simple cycles. Clusters are formed of molecules that are isomorphic after 
this simplification step. Each cluster can then be represented by a single 
molecular structure, along with pointers to “difference sets” between that 
structure and each of the full molecules in the cluster it represents. For both 
coarse and fine search, we use the Tanimoto distance metric, defined as 

|mcs(G'i, 

G 1 \ + \G 2 \-\mcs(G 1 ,G 2 )y 

where mcs refers to the maximum common subgraph of two chemical graphs. 

The coarse search is performed in compressed space, by searching the 
coarse database with the goal of identifying possible hits. The query molecule 
is simplified in exactly the same manner as the molecular database during 
clustering, and this transformed query graph is matched against the coarse 
database. To preserve sensitivity, this coarse search is performed with a 
permissive similarity score. Any possible hits—molecular graphs from the 
coarse database whose MCS to the transformed query molecule was within 
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the similarity score threshold—are then reconstructed by following point¬ 
ers to the removed atom and bond information and recreating the original 
molecules. Since the Tanimoto distance is used, we can bound the size of 
candidate molecules based on the size of the query molecule and the desired 
Tanimoto cutoff. Thus, a second level of clustering, at query time, based 
on molecule size, allows further gains in runtime performance. Finally, the 
fine search is performed against these decompressed possible hits that are 
within the appropriate size range based on the Tanimoto distance cutoff. 

MICA metagenomic search 

CaBLASTX’s clustering approach relies on sequence similarity. We aug¬ 
mented the entropy-scaling data structure by using different distance func¬ 
tions for clustering and search. For clustering, we rely on sequence identity, 
while for search, we use the E-value measure that is standard for BLAST. 
All benchmarks were performed with an E-value of 10 - '. For coarse search, 
MICA uses the DIAMOND argument —top 60 in order to return all queries 
with a score within 60% of the top hit. When MICA was tested using 
BLASTX for coarse search, it used an E-value of 1000. This seemingly sur¬ 
prisingly large coarse E-value is used because E-values are poorly behaved 
for short sequences; in sensitivity analysis, coarse E-values of 1 and 10 ex¬ 
hibited recall below 10%, and an E-value of 100 exhibited recall below 60%. 
Furthermore, during clustering (compression), we apply a preprocessing step 
that identifies subsequences to be treated as distinct points in the database. 
We apply a reversible alphabet reduction to the protein sequences, which 
projects them into a subspace (Supplemental Methods). 

When applied to high-coverage next-generation sequencing queries, ca- 
BLASTX can also perform clustering on the reads (Supplemental Methods). 
In this instance, coarse search is performed by matching each representa¬ 
tive query with a set of representative database entries. Fine search then 
matches the original queries within each cluster with the candidate database 
entries resulting from the coarse search. 

esFragBag protein structure search 

In FragBag, the bag of fragments is essentially a term frequency vector 
representing the number of occurrences of each structural motif within the 
protein. FragBag turns out to be amenable to acceleration using an entropy¬ 
scaling data structure because much of the computation is spent in doing a 
similarity search on that frequency vector. 

For the cluster generation, we trivially used a naive randomized greedy 
2-pass approach. First, all proteins in the Protein Data Bank were randomly 
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ordered. Then in the first pass, proteins were selected as cluster centers if 
and only if they were not within a user-specified Euclidean distance r c from 
an existing center (i.e., the first protein is always selected, and the second 
if further away than r c from the first, etc.). Recall that this generation of 
cluster centers is the same as the one used to generate covering spheres in 
Figure 2; the covering spheres were overlapping, but we assign every protein 
uniquely to a single cluster by assigning to the nearest cluster center in the 
second pass. 

Similarity search here was performed exactly as described in the sec¬ 
tion “Entropy-Scaling Similarity Search”, with no modification. For a given 
search query q and search radius r, a coarse search was used to find all clus¬ 
ter centers within distance r + r c of q. Then, all corresponding clusters were 
unioned into a set F. Finally, a fine search was performed over the set F to 
find all proteins within distance r of q. 


Author Contributions 

Y.W.Y., N.M.D., and B.B. conceived the project. Y.W.Y., N.M.D., 
and B.B. developed the theoretical analyses. N.M.D. and D.C.D. imple¬ 
mented and benchmarked MICA. D.C.D. implemented and benchmarked 
Ammolite, with help from N.M.D. and Y.W.Y. Y.W.Y. implemented and 
benchmarked esFragBag, with help from N.M.D. B.B. guided all research 
and provided critical advice on the study. Y.W.Y., N.M.D. and B.B. wrote 
the manuscript. 


Acknowledgments 

Y.W.Y. is supported by a Hertz Foundation fellowship. N.M.D. and 
B.B. are supported by NIH GM108348. We thank Andrew Gallant for his 
implementation of Fragbag. We thank Joseph V. Barrile for graphic design. 
We thank Jian Peng for suggesting high-throughput drug screening as an 
application. 


References 

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). 
Basic local alignment search tool. Journal of molecular biology, 215, 403- 
410. 


24 



Berger, B., Peng, J., & Singh, M. (2013). Computational solutions for ornics 
data. Nature Reviews Genetics, 14, 333-346. 

Bolton, E. E., Wang, Y., Thiessen, P. A., &; Bryant, S. H. (2008). Pubchem: 
integrated platform of small molecules and biological activities. Annual 
reports in computational chemistry, 4, 217-241. 

Bredel, M., & Jacoby, E. (2004). Chemogenomics: an emerging strategy for 
rapid target and drug discovery. Nature Reviews Genetics, 5, 262-275. 

Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein 
alignment using DIAMOND. Nature methods, 12, 59-60. 

Budowski-Tal, I., Nov, Y., &: Kolodny, R. (2010). FragBag, an accurate 
representation of protein structure, retrieves structural neighbors from the 
entire PDB quickly and accurately. Proceedings of the National Academy 
of Sciences, 107, 3481-3486. 

Bunke, H., & Shearer, K. (1998). A graph distance metric based on the 
maximal common subgraph. Pattern recognition letters, 19, 255-259. 

Cao, Y., Jiang, T., & Girke, T. (2008). A maximum common substructure- 
based algorithm for searching and predicting drug-like compounds. Bioin¬ 
formatics, 24, i366-i374. 

Ciaccia, P., Patella, M., & Zezula, P. (1997). Deis-csite-cnr. In Proceedings 
of the... International Conference on Very Large Data Bases (p. 426). 
Morgan Kaufmann Pub volume 23. 

Ciaccia, P., Patella, M., Sz Zezula, P. (1998). A cost model for similarity 
queries in metric spaces. In Proceedings of the seventeenth ACM SIGACT- 
SIGMOD-SIGART symposium on Principles of database systems (pp. 59- 
68). ACM. 

Conway, T. C., & Bromage, A. J. (2011). Succinct data structures for 
assembling large genomes. Bioinformatics, 27, 479-486. 

Daniels, N. M., Gallant, A., Peng, J., Cowen, L. J., Bayrn, M., &; Berger, 
B. (2013). Compressive genomics for protein databases. Bioinformatics, 
29, i283-i290. 

David, L. A., Materna, A. C., Friedman, J., Campos-Baptista, M. I., Black¬ 
burn, M. C., Perrotta, A., Erdman, S. E., & Aim, E. J. (2014). Host 
lifestyle affects human microbiota on daily timescales. Genome Biol, 15, 
R8. 


25 



Falconer, K. (1990). Fractal geometry: mathematical foundations and appli¬ 
cations. John Wiley & Sons. 

Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., & El Abbadi, A. (2000). Vec¬ 
tor approximation based indexing for non-uniform high dimensional data 
sets. In Proceedings of the ninth international conference on Information 
and knowledge management (pp. 202-209). ACM. 

Ferragina, P., &; Manzini, G. (2000). Opportunistic data structures with 
applications. In Foundations of Computer Science, 2000. Proceedings. 
41st Annual Symposium on (pp. 390-398). IEEE. 

Frank, D. N., & Pace, N. R. (2008). Gastrointestinal microbiology enters 
the metagenomics era. Current opinion in gastroenterology, 24, 4 10. 

Gire, S. K., Goba, A., Andersen, K. G., Sealfon, R. S., Park, D. J., Kanneh, 
L., Jalloh, S., Momoh, M., Fullah, M., Dudas, G. et al. (2014). Genomic 
surveillance elucidates ebola virus origin and transmission during the 2014 
outbreak. Science, 345, 1369-1372. 

Grossi, R., & Vitter, J. S. (2005). Compressed suffix arrays and suffix trees 
with applications to text indexing and string matching. SIAM Journal on 
Computing, 35, 378-407. 

Hart, Y., Sheftel, H., Hausser, J., Szekely, P., Ben-Moshe, N. B., Korern, 
Y., Tendler, A., Mayo, A. E., & Alon, U. (2015). Inferring biological 
tasks using pareto analysis of high-dimensional data. Nature methods, 
12, 233-235. 

Hegyi, H., & Gerstein, M. (1999). The relationship between protein struc¬ 
ture and function: a comprehensive survey with application to the yeast 
genome. Journal of molecular biology, 288, 147-164. 

Huson, D. H., Mitra, S., Ruscheweyh, H.-J., Weber, N., &; Schuster, S. C. 
(2011). Integrative analysis of environmental sequences using megan4. 
Genome research, 21, 1552-1560. 

Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards 
removing the curse of dimensionality. In Proceedings of the thirtieth 
annual ACM symposium on Theory of computing (pp. 604-613). ACM. 

Kahn, S. D. (2011). On the future of genomic data. Science(Washington), 
331, 728-729. 


26 



Kanehisa, M., Sz Goto, S. (2000). KEGG: kyoto encyclopedia of genes and 
genomes. Nucleic acids research, 28, 27-30. 

Kent, W. J. (2002). BLAT-the BLAST-like alignment tool. Genome re¬ 
search, 12, 656-664. 

Langille, M. G., Zaneveld, J., Caporaso, J. G., McDonald, D., Knights, 
D., Reyes, J. A., Clemente, J. C., Burkepile, D. E., Thurber, R. L. V., 
Knight, R., Sz Huttenhower, C. (2013). Predictive functional profiling of 
microbial communities using 16S rRNA marker gene sequences. Nature 
biotechnology, 31, 814 821. 

Linial, M., Linial, N., Tishby, N., Sz Yona, G. (1997). Global self¬ 
organization of all known protein sequences reveals inherent biological 
signatures. Journal of molecular biology, 268, 539-556. 

Loh, P.-R., Baym, M., Sz Berger, B. (2012). Compressive genomics. Nature 
biotechnology, 30, 627-630. 

MacFabe, D. F. (2012). Short-chain fatty acid fermentation products of the 
gut microbiome: implications in autism spectrum disorders. Microbial 
ecology in health and disease, 23. 

Mackelprang, R., Waldrop, M. P., DeAngelis, K. M., David, M. M., Chavar¬ 
ria, K. L., Blazewicz, S. J., Rubin, E. M., Sz Jansson, J. K. (2011). Metage- 
nornic analysis of a permafrost microbial community reveals a rapid re¬ 
sponse to thaw. Nature, 480, 368-371. 

Marx, V. (2013). Biology: The big challenges of big data. Nature, 498, 
255-260. 

Menke, M., Berger, B., Sz Cowen, L. (2008). Matt: local flexibility aids 
protein multiple structure alignment. PLoS computational biology, 4, 
elO. 

Nepomnyachiy, S., Ben-Tal, N., Sz Kolodny, R. (2014). Global view of the 
protein universe. Proceedings of the National Academy of Sciences, 111, 
11691-11696. 

Pruitt, K. D., Tatusova, T., Sz Maglott, D. R. (2005). NCBI reference se¬ 
quence (RefSeq): a curated non-redundant sequence database of genomes, 
transcripts and proteins. Nucleic acids research, 33, D501-D504. 


27 



Rahman, S. A., Bashton, M., Holliday, G. L., Schrader, R., & Thornton, 

J. M. (2009). Small molecule subgraph detector (SMSD) toolkit. Journal 
of Cheminformatics, 1, 1-13. 

Sayers, E. W., Barrett, T., Benson, D. A., Bolton, E., Bryant, S. H., Canese, 

K. , Chetvernin, V., Church, D. M., DiCuccio, M., Federhen, S. et al. 
(2011). Database resources of the national center for biotechnology infor¬ 
mation. Nucleic acids research, 39, D38-D51. 

Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1, 
27-64. 

Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., & Hut- 
tenhower, C. (2012). Metagenomic microbial community profiling using 
unique clade-specific marker genes. Nature methods, 9, 811-814. 

Shindyalov, I. N., & Bourne, P. E. (1998). Protein structure alignment by 
incremental combinatorial extension (CE) of the optimal path. Protein 
engineering, 11, 739-747. 

Subbiah, S., Laurents, D., & Levitt, M. (1993). Structural similarity of 
DNA-binding domains of bacteriophage repressors and the globin core. 
Current Biology, 3, 141-148. 

Tao, T. (2008). Product set estimates for non-commutative groups. Com- 
binatorica, 28, 547-594. 

Tyson, G. W., Chapman, J., Hugenholtz, P., Allen, E. E., Ram, R. J., 
Richardson, P. M., Solovyev, V. V., Rubin, E. M., Rokhsar, D. S., & 
Banfield, J. F. (2004). Community structure and metabolism through 
reconstruction of microbial genomes from the environment. Nature, 428, 
37-43. 

Uhlmann, J. K. (1991). Satisfying general proximity/similarity queries with 
metric trees. Information processing letters, 40, 175-179. 

Ukkonen, E. (1985). Algorithms for approximate string matching. Informa¬ 
tion and control, 64, 100-118. 

Weber, R., Schek, H.-J., & Blott, S. (1998). A quantitative analysis and per¬ 
formance study for similarity-search methods in high-dimensional spaces. 
In VLDB (pp. 194-205). volume 98. 


28 



Yona, G., Linial, N., & Linial, M. (1999). Protomap: automatic classification 
of protein sequences, a hierarchy of protein families, and local maps of 
the protein space. Proteins: Structure, Function, and Bioinformatics, 37, 
360-378. 

Yu, Y. W., Yorukoglu, D., Peng, J., & Berger, B. (2015). Quality score 
compression improves genotyping accuracy. Nature Biotechnology, 33, 
240-243. 

Zezula, P., Amato, G., Dohnal, V., & Batko, M. (2006). Similarity search: 
the metric space approach volume 32. Springer Science & Business Media. 

Zhao, Y., Tang, H., & Ye, Y. (2012). RAPSearch2: a fast and memory- 
efficient protein similarity search tool for next-generation sequencing data. 
Bioinformatics, 28, 125-126. 


29 



arXiv:1503.05638v2 [cs.DS] 21 Sep 2015 


Supplemental Methods 

Theory 

Time-complexity 

We introduced the definition of the entropy-scaling similarity search data 
structure in Figure 1. For ease of analysis, we will work in a high-dimensional 
metric space and consider the database as a set D of n unique points in that 
metric space. Define Bg(q,r) = {p £ S : ||g — p\ \ < r}. The similarity 
search problem is thus to compute Bf)(q , r ) for a query q and radius r. Note 
however that the metricity requirement is needed only for a 100% sensitivity 
guarantee; other distance functions can be used, but result in some loss in 
sensitivity. However, regardless of the distance function chosen, there cannot 
be a loss of specificity; false positives will never be introduced because the 
fine search is just the original search function on a smaller subset of the 
database. 

A set C of k cluster centers are chosen such that no cluster has radius 
greater than a user-specified parameter r c and no two cluster centers are 
within distance r c of one another. The data structure then clusters the points 
in the set by assigning them to their nearest cluster center. Overloading 
notation a bit, we will identify each cluster with its center, so C is also 
the set of clusters. For a given similarity search query for all items within 
distance r of a query q. this data structure breaks the query into coarse 
and fine search stages. The coarse search is over the list of cluster centers, 
returning Bc{q , r + r c ). Let 


U 

c£B c (q,r+r c ) 

the union of all the returned clusters. By the Triangle Inequality, Bf)(q, r ) C 
F, which combined with F C D implies that Bp(q,r ) = Bf)(q,r). Thus, a 
fine search over the set F will return all items within radius r of q. 

Note that we require the metricity requirement only for the Triangle 
Inequality. It turns out that many interesting distance functions are not 
metrics, but still almost satisfy the Triangle Inequality, which is nearly suf¬ 
ficient. More precisely, if a fraction a of the triples in S do not satisfy the 
Triangle Inequality, then in expectation, we will have sensitivity 1 — a. As 
shown in the results, empirically, this loss in sensitivity appears to be low 
and can likely be ameliorated by increasing the coarse search radius. 

Provided the fractal dimension of the database is low, and given a few 
technical assumptions on the distribution of points, this data structure al- 
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lows for similarity search queries in time roughly linear in the metric en¬ 
tropy of the database. Additionally, without increasing the asymptotic 
time-complexity, this data structure can also be stored in an information 
theoretic entropy-compressed form. 

Note that entropy-scaling data structures are distinct from both succinct 
data structures and compressed data structures. Succinct data structures 
are ones that use space close to the information-theoretic limit in the worst 
case while permitting efficient queries; i.e. succinct data structures do not 
depend on the actual entropy of the underlying data set, but have size- 
dependence on the potential worst-case entropy of the data set (Jacobson, 
1988). Compressed (and opportunistic) data structures, on the other hand, 
bound the amount of the space used by the entropy of the data set while per¬ 
mitting efficient queries (Grossi & Vitter, 2005; Ferragina & Manzini, 2000). 
Entropy-scaling data structures are compressed data structures, but are dis¬ 
tinct, as unlike entropy-scaling data structures, compressed data structures 
do not measure time-complexity in terms of metric entropy. Additionally, 
existing compressed data structures such as the compressed suffix array and 
the FM-index are designed for the problem of pattern matching (Grossi & 
Vitter, 2005; Ferragina & Manzini, 2000). While related to similarity search, 
pattern matching does not admit as general of a notion of distance as the 
similarity search problem. While compressed sensing has also been applied 
to the problem of finding a representative set of genes for a collection of 
expression samples (Prat et ah, 2011), compressed sensing is distinct from 
entropy-scaling data structures. 

The primary advance of entropy-scaling data structures is that 
they bound both space and time as functions of the data set en¬ 
tropy (albeit using two different notions of entropy). 

Unifying different notions of fractal dimension 

In this paper, we make use of two different notions of fractal dimension, 
including the more intuitive formulation in the main paper and a more 
classical definition in this following section. In order to both unify these 
different notions of fractal dimension and to allow us to prove our complexity 
bounds, we make a technical assumption about the self-similarity of the 
data set. Roughly speaking, we require that the density of points be similar 
throughout the data set. 

More precisely, for a given radius, consider the number of neighbors 
about any point for that radius, which we will call the density around a 
point. We assume that for any given radius, the densities around all points 
are bounded within a constant multiplicative factor 7 of one another. This 
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particular technical assumption is likely stronger than we need, but makes 
the following arguments much more convenient; we conjecture but do not 
prove here that so long as the distribution of densities has low variance, 
all our statements below also hold with high probability. We give special 
thanks to one of our commenters for having brought this subtlety to our 
attention. 


Complexity bounds 

We first define the concept of metric entropy and entropy dimension in 
the classical manner: 


Definition 1 ((Tao, 2008) Definition 6.1). Let X be a metric space, let D 
be a subset of X, and let p > 0 be a radius. 


• The metric entropy N p (D) is the fewest number of points x\,... ,x n £ 
D such that the balls B(x\, p ),... B(x n , p) cover D. 


Definition 2 ((Falconer, 1990)). The Hausdorff dimension of a set D is 
given by 


dimHausdorff(-D) := lim 
p—¥0 


log Np{D) 

log 1/p 


Unfortunately, as D is a finite, discrete, set, the given definision always 
gives diniHausdorff(^) = 0. However, we are only interested in scaling behav¬ 
iors around large radii, so instead we use: 


Definition 3. Define fractal dimension d of a set D at a scale \p \, P 2 ] by 


d = max 

pe[pi,p 2 ] 


1 0£ MP1 

lQ g N P 1 (D) 

log 


Intuitively, this means that when we double the radii, the metric entropy, 
or number of covering spheres needed, decreases by a multiplicative factor of 
2 d . This definition of fractal dimension is classical, but the reader will note 
also different in formulation from the intuitive one given in the main text. 
However, the two notions are related given the bounded density assumption 
we made, from which immediately follows a bound on the number of points 
within each cluster. On average, when we double the radius of a sphere 
around a point, the number of points in the larger sphere is roughly the 
number of points in the smaller sphere multiplied by 2 d , because otherwise 
the spheres could not cover the space. This latter behavior is what we 
measure when we talk about local fractal dimension around a point in the 
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main paper. However, because the number of points within each cluster must 
remain within a multiplicative constant of each other by the bounded density 
assumption, not only is this true on average, but the increase in number of 
points within a cluster has to be roughly uniform across all clusters. Thus, 
this scaling behavior of points in a doubled sphere must be true everywhere 
in the data set, and thus we can connect the global average local fractal 
dimension alluded to in the main paper and measured for our data sets of 
interest to this more classical notion of fractal dimension based on covering 
spheres. 

Recall that k entries are selected as cluster centers for partitioning the 
database to result in clusters with maximum radius r c . From the definition 
above, when setting p = r c , it is trivial to verify k < N rc (D). This upper 
bound is guaranteed by our requirement that the cluster centers not be 
within distance r c . 

Given any query q, the coarse search over the cluster centers always 
requires k comparisons. Additionally, the fine search is over the set F, 
defined to be the union of clusters with centers within distance r + r c from 
q. As the time-complexity of similarity search is just the total of the coarse 
and fine searches, this implies that the total search time is 0{k + | F\). 

By the triangle inequality, F C Bjj(q,r + 2r c ), so we can bound |F| < 
\Bo{q, r + 2r c )|. Let the fractal dimension D at the scale between r c and 
2r c + r be d. Recall that the local fractal dimension determines how many 
more points we hit when we scale the radius of a sphere. Also, we use the 
density bound to give us that 

- \Bd(p,p)\ < IE q \B D (q,p)\ < 7 |-Bc(p,/))| 

7 

for any choice of point p and radius p. Then 

E q [\B D (q,r + 2r c )\\ < 7 \B D (p, r + 2r c )\ (1) 

< 7|-Br>(p,r)| ( r+ r 2rC ) ( 2 ) 

<7% [\B D (q,r)\\ , (3) 

because we are measuring the relative number of points found in spheres of 
radius r vs radius r + 2 r c respectively for some particular point p. But 7 is 
a constant, and disappears in asymptotic notation. Thus, total search time 
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is in expectation over points q 


O (*+!*»(»,r)l(^)'). 

However, note that k is linear in metric entropy and \Bjj(q, r)| is the output 
size, so similarity search can be performed in time linear to metric entropy 
and a polynomial factor of output size. Provided that the fractal dimension 
d is small and k is large, the search time will be dominated by the metric 
entropy component, which turns out to be the regime of greatest interest 
for us. We have thus proven bounds for the time-complexity of similarity 
search, given a self-similarity condition on the density of the dataset. 

Space-complexity 

Here we relate the space-complexity of our entropy-scaling similarity 
search data structure to information-theoretic entropy. Traditionally, information- 
theoretic entropy is a measure of the uncertainty of a distribution or random 
variable and is not well-defined for a finite database. However, the notion of 
information-theoretic entropy is often used in data compression as a short¬ 
hand for the number of bits needed to encode the database, or a measure 
of the randomness of that database. We use entropy in the former sense; 
precisely, we define the entropy of a database as the number of bits needed 
to encode that database, a standard practice in the field. Thus, we consider 
entropy-compressed forms of the original database, such as that obtained 
by Prediction by Partial Matching (PPM), Lempel-Ziv compression (e.g. 
Gzip), or a Burrows-Wheeler Transform (as in Bzip2), and use their size as 
an estimate of the entropy S or i g of the database. 

For all commonly used compression techniques, decompression time is 
linear in the size of the uncompressed data. Obviously, even with linear 
decompression, decompressing the entire database for each similarity search 
would squander the entropy-scaling benefits of our approach. However, note 
that the fine search detailed above only needs access to a subset of clusters 
and furthermore needs full access to that set of clusters. It is therefore 
always asymptotically ‘free’ to decompress an entire cluster at once, if any 
member of that cluster needs to accessed. Thus, one ready solution is to 
simply store entropy-compressed forms of each cluster separately. 

Compressing each cluster separately preserves runtime bounds, but makes 
it difficult to compare the compressed clustered database size to the original 
compressed database size. This results from the possibility that redundancy 
across clusters that would originally have been exploited by the compressor 
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can no longer be exploited once the database is partitioned. Intuitively, for 
any fixed-window or block compressor, grouping together similar items into 
clusters should increase the performance of the compressor, but it is unclear 
a priori if that balances out the loss of redundancy across clusters. 

A somewhat more sophisticated solution is to reorder the entries of the 
database by cluster, compress the entire database, and then store indexes 
into the starting offset of each cluster. For popular tools such as Gzip 
or Bzip2, this is possible with constant overhead k per index. Because 
the entire database is still being compressed, redundancy across clusters 
can be exploited to reduce compressed size, while still taking advantage of 
similar items being grouped together. Thus, in expectation over uniformly- 
randomly chosen orderings of the database entries (obviously, there is some 
optimal ordering, but computing that is computationally infeasible), the 
compressed clustered database size S c i us t < S or i g . Then, total expected 
space-complexity of our data structure is 0{nk + S or i g )] recall here that k is 
the number of clusters and is bounded by the metric entropy of the database. 
Thus, space complexity is linear in metric entropy plus information-theoretic 
entropy. 

Additionally, given that our distance function measures marginal information- 
theoretic entropies, we can also give a bound on the total information- 
theoretic entropy of the database by using metric entropy and the cluster 
radius. Let l be the maximum distance of two points in the space. The 
naive upper bound on total entropy is then 0(nl), where n is the total 
number of points in the database, because distance and entropy are re¬ 
lated. Recall that we chose k points as cluster centers, where k is bounded 
by metric entropy, for a maximum cluster radius r c . Encoding each non¬ 
center point p as a function of the nearest cluster center requires 0(nr c ) 
bits. Specifying the privileged points again requires O(kl) bits, so together 
the total information-theoretic entropy is 0{kl + nr c ). In other words, not 
only is space complexity linear in metric entropy plus information-theoretic 
entropy, but information-theoretic entropy itself is also bounded by the low¬ 
dimensional coarse structure of the database. 

Clustering time complexity 

Although clustering the database is a one-time cost that can further be 
amortized over future queries, we still require that cluster generation be 
tractable. Here we present a trivial 0(kn ) algorithm for cluster generation 
with clusters of maximum radius r c that appears to work sufficiently well in 
practice (and is the algorithm used in esFragBag): 
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• Initialize an empty set of cluster centers C. Let S(x, C ) be the distance 
from a point x to C, defined to be oo if C = 0. 

• Randomly order the n entries of the database D = {cii,..., d n } 

• For i = 1,... n, 

— If 5(dj , C) > r c , append dj to C. 

• For i = 1,... n, 

— Assign di to the cluster represented by the nearest item in C. 

Because we need to compare each of n items against up to k items in C in 
each of the for loops, this trivial algorithm takes 0{kri) time. 

Additionally, insertions can be performed in 0{k ) time. For a new entry 
d n+ 1 , if 5(d n+ 1 , C ) < r c , assign d n+ \ to the cluster represented by the nearest 
item in C. Otherwise, append d n+ \ to C as a new cluster center. This clearly 
requires exactly k comparisons to do. Note that with insertions of this kind, 
items are no longer guaranteed to be assigned to the nearest cluster center; 
however, they are still guaranteed to be assigned to some cluster center 
within distance r c , which is all that is needed for entropy-scaling to work. 

Deletions are slightly more complicated. If the entry to be deleted is not 
a cluster center, then removing it takes constant time. However, if it is a 
cluster center, we effectively have to remove the entire cluster and reinsert 
all the non-center elements, which will take 0(k ■ [size of cluster]). Thus, in 
expectation over a uniform random choice of item to be deleted, deletions 
can also be performed in 0{k ) time. 

Ammolite 

Simplification and compression 

Given a molecular graph, any vertex or edge that is not part of a simple 
cycle or a tree is removed, and any edge that is part of a tree is removed. 
This preserves the node count, but not the topology, of tree-like structures, 
and preserves simple cycles, which represent rings in chemical compounds. 
For example, as shown in Figure S2, both caffeine and adenine would be 
reduced to a purine-like graph. 

After this transformation is applied to each molecule in a database to be 
compressed, we identify all clusters of fully-isomorphic transformed molec¬ 
ular graphs. Isomorphism detection is performed using the VF2 (Cordelia 
et ah, 2001) algorithm; a simple hash computed from the number of ver¬ 
tices and edges in each transformed molecular graph is first used to filter 
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molecular graphs that cannot possibly be isomorphic. A representative from 
each such cluster is stored in SDF format; collectively, these representatives 
form a “coarse” database. Along with each representative, we preserve the 
information necessary to reconstruct each original molecule, as a pointer to 
a set of vertices and edges that have been removed or unlabeled. 

Ammolite is implemented in Java, and its source code is available on 
Github. 

MICA 

Alphabet Reduction 

Alphabet reduction- reducing the 20-letter standard amino acid alpha¬ 
bet to a smaller set, in order to accelerate search or improve homology 
detection—has been proposed and implemented several times (Bacardit et al., 
2007; Peterson et ah, 2009). In particular, Murphy et al. (2000) considered 
reducing the amino-acid alphabet to 17, 10, or even 4 letters. More recently, 
Zhao et al. (2012) and Huson & Xie (2013) applied a reduction to a 4-letter 
alphabet, termed a “pseudoDNA” alphabet, in sequence alignment. 

When using BLASTX for coarse search (which we call caBLASTX), we 
extend the compression approach of Daniels et al. (2013) using a reversible 
alphabet reduction. We use the alphabet reduction of Murphy et al. (2000) 
to map the standard amino acid alphabet (along with the four common 
ambiguous letters ) onto a 4-letter alphabet. Specifically, we map F, W, 
and Y into one cluster; C, I, L, M, V, and J into a second cluster, A, G, 
P, S, and T into a third cluster, and D, E, N, Q, K, R, H, B, and Z into 
a fourth cluster. By storing the offset of the original letter within each 
cluster, the original sequence can be reconstructed, making this a reversible 
reduction. This alphabet reduction is not used when using DIAMOND for 
coarse search, as DIAMOND already relies on its own alphabet reduction. 

Database Compression 

Given a protein sequence database to be compressed, we proceed as 
follows: 

1. First, initialize a table of all possible k-iner seeds of our (possibly 4- 
letter reduced) alphabet, as well as a coarse database of sequences, 
initially containing the (possibly reduced-alphabet) first sequence in 
the input database. 

2. For each A:-mer of the first sequence, store its position in the corre¬ 
sponding entry in the seed table. 



3. For each subsequent sequence s in the input, reduce its alphabet and 
slide a window of length k along the sequence, skipping single-letter 
repeats of length greater than 10. 

4. (a) Look up these k residues in the seed table. For every entry match¬ 

ing to that /c-mer in the seed table, follow it to a corresponding 
subsequence in the coarse database and attempt extension (de¬ 
fined below). If no subsequences from this window can be ex¬ 
tended, move the window by m positions, where m defaults to 
20 . 

(b) If a match was found via extension, move the /c-mer window to 
the first /c-mer in s after the match, and repeat the extension 
process. 

Given a /c-mer in common between sequence s and a subsequence s' 
pointed to by the seed table, first attempt ungapped extension: 

1. Within each window of length m beginning with a /c-mer match, if 
there are at least 60% matches between s and s', then there is an 
ungapped match. 

2. Continue ungapped matching using m-mer windows until no more '/ri¬ 
mers of at least 60% sequence identity are found. 

3. The result of ungapped extension is that there is an alignment between 
s and s' where the only differences are substitutions, at least 60% of 
the positions contain exact matches. 

When ungapped extension terminates, attempt gapped extension. From 
the end of the aligned regions thus far, align 25-rner windows of both s and s' 
using the Needleman-Wunsch (Needleman & Wunsch, 1970) algorithm using 
an identity matrix. Note that the original caBLASTP (Daniels et ah, 2013) 
used BLOSUM62 as it was operating in amino acid space; as we are now 
operating in a reduced-alphabet space, an identity matrix is appropriate, 
just as it is for nucleotide space. After gapped extension on a window length 
of 25, attempt ungapped extension again. 

If neither gapped nor ungapped extension can continue, end the exten¬ 
sion phase. If the resulting alignment has less than 70% sequence identity 
(in the reduced-alphabet space), or is shorter than 40 residues, discard it, 
and attempt extension on the next entry in the seed table for the original 
/c-mer, continuing on to the next /c-mer if there are no more entries. 

If the resulting alignment does have at least 70% sequence identity in 
the reduced-alphabet space, and is at least 40 residues long, then create 
a link from the entry for s' in the coarse database to the subsequence of s 
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corresponding to the alignment. If there are unaligned ends of s shorter than 
30 residues, append them to the match. Longer unaligned ends that did not 
match any subsequences reachable from the seed table are added into the 
coarse database themselves, following the same L-nrer indexing procedure as 
the first sequence. 

Finally, in order to be able to recover the original sequence with its 
original amino acid identities, a difference script is associated with each link. 
This difference script is a representation of the insertions, deletions, and 
substitutions resulting from the Needleman-Wunsch alignment, along with 
(if alphabet reduction is used) the offset in each reduced-alphabet cluster 
needed to recover the original alphabet. Thus, for example, a valine (V) is 
in the cluster containing C, I, L, M, V, and J. Since it is the 4th entry in 
that 5-entry cluster, we can represent it with the offset 4. Since the largest 
cluster contains 9 elements, only four bits are needed to store one entry 
in the difference script. More balanced clusters would have allowed 3-bit 
storage, but at the expense of clusters that less faithfully represented the 
BLOSUM62 matrix and the physicochemical properties of the constituent 
amino acids. 

Because of the seed table, compression is memory-intensive and CPU¬ 
intensive. Compressing the September, 2014 NCBI NR database required 
approximately 39 hours on a 12-core Xeon with 128GB RAM. 

Query Clustering 

Metagenomic reads are themselves nucleotide sequences, so no alphabet 
reduction is performed on them directly. When BLASTX is used for coarse 
search (caBLASTX), MICA relies on query-side clustering. Metagenomic 
reads are compressed using the same approach as the protein database, 
without the alphabet reduction step and with a number of different parame¬ 
ters. The difference scripts for metagenomic reads do not rely on the cluster 
offsets, but simply store the substituted nucleotides. 

Furthermore, unlike protein databases, where most typical sequences 
range in length from 100 to over 1000 amino acids, next-generation sequenc¬ 
ing reads are typically short and usually of fixed length, which is known 
in advance. Thus, the minimum alignment length required for a match, 
and the maximum length unaligned fragment to append to a match, require 
different values based on the read length. 

An additional complication is that insertions and deletions from one read 
to another will change the reading frame, potentially resulting in different 
amino acid sequences. For this reason, query clustering requires long, un¬ 
gapped windows of high sequence identity. Specifically, for 202-nucleotide 
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reads, for two sequences to cluster together, we require a 150-nucleotide 
ungapped region of at least 80% sequence identity. 

We note that unlike the compression of the database, which can be amor¬ 
tized over future queries, the time spent clustering and compressing the 
queries cannot be amortized. Thus, we would not refer to the query clus¬ 
tering as entropy-scaling, but it still provides a constant speed-up. For this 
reason, we include the time spent clustering and compressing queries in the 
search time for MICA. When using DIAMOND for coarse search, MICA 
does not perform query-side clustering, and instead relies on DIAMOND’S 
indexing of the queries. 

Search 

Given a compressed protein database and a compressed query read set, 
search comprises two phases. The first, coarse search , considers only the 
coarse sequences—the representatives—resulting from compression of the 
protein database and the query set. When BLASTX is used for the coarse 
search (caBLASTX), each coarse nucleotide read is transformed into each of 
the six possible amino acid sequences that could result from it (three reading 
frames for both the sequence and its reverse complement). Then, each of 
these amino acid sequences is then reduced back to a four-letter alphabet 
using the same mapping as for protein database compression. For conve¬ 
nience, the four-letter alphabet is represented using the standard nucleotide 
bases, though this has no particular biological significance. This is done so 
that the coarse search can rely on BLASTN (nucleotide BLAST) to search 
these sequences against the compressed protein database. 

For each coarse query representative (identified using a a coarse E- 
value of 1000, along with the BLASTN arguments -task blastn-short 
-penalty -1; these arguments are recommended by the NCBI BLAST+ 
manual when queries are short), the set of coarse hits is used to reconstruct 
all corresponding sequences from the original database by following links 
to original sequence matches and applying their difference scripts. The re¬ 
sulting candidates are thus original sequences from the protein database, in 
their original amino acid alphabet. The query representative is also used to 
reconstruct all corresponding sequences from the original read set. Thus, for 
each coarse query representative, there is now a subset of the metagenomic 
read set (the reads represented by that coarse query) and also a subset of 
the protein database (the candidates). 

When DIAMOND is used for coarse search, instead of an E-value thresh¬ 
old, the argument —top 60 is used; this causes DIAMOND to return all 
coarse hits whose score is within 60% of the top-scoring hit. Without this 
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argument, DIAMOND defaults to returning at most 25 hits for each query 
sequence, which would result in significant loss of recall. 

The second phase, fine search , uses standard DIAMOND or BLASTX to 
translate each of these reads associated with a coarse query representative 
and search for hits only in the subset of the database comprising the candi¬ 
dates. This fine search phase relies on a user-specified E-value threshold (or 
other user-specified parameters to DIAMOND or BLASTX) to filter hits. 

To ensure that E-value calculation is correct, the call to BLASTX uses a cor¬ 
rected database size which is the size of the original, uncompressed protein 
database. 

Benchmarking 

Although our primary result is the direct acceleration of DIAMOND us¬ 
ing our entropy-scaling data structures, we also compared MICA to RapSearch2 (Zhao 
et ah, 2012) version 2.22 and the November 29, 2014 version of DIAMOND (Buchhnk 
et al., 2015). All tests were performed on a 12-core Intel Xeon X5690 running 
at 3.47GHz with 88GB RAM and hyperthreading; 24 threads were allowed 
for all programs. Diamond was run with the —sensitive option. In all 
cases, an E-value threshold of le-7 was used. 

For the raw-read dataset, we filtered out reads starting or ending with 
10 or more no-calls (’N’). 

MICA is implemented in Go, and its source code is available on Github. 

esFragBag 

We took the existing FragBag method as a black box and by design 
did not do anything clever in esFragBag except apply the entropy-scaling 
similarity search data structure. We used a Go language implementation 
of FragBag, written by Andrew Gallant. Additionally, we removed the 
sorting-by-distance feature of Andrew Gallant’s FragBag search implemen¬ 
tation, which does not improve the all-matching results we were interested in 
here—it lowers k -nearest neighbor search memory requirements while dom¬ 
inating the running time of p-nearest neighbor, the problem at hand. This 
was done for both the FragBag and the esFragBag benchmarks, to ensure 
comparability. All code was written in Go, and is available on Github. 

The entire 2014 Oct 31 version of the Protein Data Bank was down¬ 
loaded and the database was composed of fragment frequency vectors gen¬ 
erated from all of the relevant PDB files using the 400-1 l.json fragment list 
(Budowski-Tal et ah, 2010). For this paper, we implemented the benchmark¬ 
ing in Go, and have provided the source code for the benchmarking routine 
on Github. This allowed us to benchmark just the search time, excluding 
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the time to load the database from disk. Note that the prototype imple¬ 
mentation of esFragBag available only supports the all p-nearest neighbor 
search query found in FragBag. 
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Supplemental Figures 


51 Genomic data available has grown at a faster exponential 
rate than computer processing power and disk storage. These 
plots represent, on a log scale, the daily growth in sequence 
data from GenBank along with (a) the combined computing 
power (in TeraFLOPs) of the Top 500 Supercomputer list, 
and (b) the largest commercially-available hard disk drives. 

52 (Related to Table lb) Ammolite’s preprocessing during the 
clustering phase. Ammolite removes nodes and edges that do 
not participate in simple cycles, and treats all edges as simple, 
unlabeled edges. In this example, both caffeine and adenine 
become a purine-like graph structure. Note that the resulting 
graph has no implicit hydrogens. 

53 (Related to Figure 3) Local fractal dimension at different 
scales for the space of PDB FragBag frequency vectors. Each 
data point is defined by dimension d = l \og(rJ/’rl) 1 w ^ere ri \, ri 2 
are the number of similarity search hits within radius respec¬ 
tively ri, r 2 , and r 2 — rq is the increment size of 0.01 for cosine 
distance and 1 for euclidean distance. In most regimes, local 
fractal dimension is consistently low, except for a large spike 
when radius expands to include the central cluster of proteins. 
esFragBag achieves the most acceleration when both output 
size is small and we remain in a low fractal dimension regime. 
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Computing Power, TeraFLOPS 



( a ) 



Figure SI: Genomic data available has grown at a faster exponential rate 
than computer processing power and disk storage. These plots represent, 
on a log scale, the daily growth in sequence data from GenBank along with 
(a) the combined computing power (in TeraFLOPs) of the Top 500 Super¬ 
computer list, and (b) the largest commercially-available hard disk drives. 
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Genomic Data, KiloBase-pairs per day 









Figure S2: (Related to Table lb) Ammolite’s preprocessing during the 
clustering phase. Ammolite removes nodes and edges that do not participate 
in simple cycles, and treats all edges as simple, unlabeled edges. In this 
example, both caffeine and adenine become a purine-like graph structure. 
Note that the resulting graph has no implicit hydrogens. 
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Fractal dimension Fractal dimension Fractal dimension 


(a) Cosine distance 


(b) Euclidean distance 




Figure S3: (Related to Figure 3) Local fractal dimension at different 
scales for the space of PDB FragBag frequency vectors. Each data point is 
defined by dimension d = ^og^/ri) > w ^ere n\, ri 2 are the number of similar¬ 
ity search hits within radius respectively rq, r ?, and r 2 — r\ is the increment 
size of 0.01 for cosine distance and 1 for euclidean distance. In most regimes, 
local fractal dimension is consistently low, except for a large spike when ra¬ 
dius expands to include the central cluster of proteins. esFragBag achieves 
the most acceleration when both output size is small and we remain in a 
low fractal dimension regime. 






































































